fix: auto-fallback to new session when resume fails with "No conversation found" #709
fix: auto-fallback to new session when resume fails with "No conversation found" #709javamagong wants to merge 1 commit intositeboon:mainfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughRefactors streaming resume/retry behavior: Changes
Sequence Diagram(s)sequenceDiagram
participant Client as WebSocket Client
participant Server as App Server
participant SDK as Claude SDK
participant Store as SessionStore
Client->>Server: start streaming query (may include resume id)
Server->>Store: lookup resume mapping
Server->>SDK: create or resume query iterator
SDK-->>Server: returns iterator or throws "no conversation"
alt resume failed ("no conversation" / stale)
Server->>Store: remove stale session mapping
Server->>Server: clear resume opts, set temp timeout, recreate query (retry)
Server->>SDK: new iterator (fresh session)
Server->>Store: add new session mapping (update sessionRef.id)
end
loop stream events
SDK-->>Server: stream chunk / token-budget / completion / error
Server->>Client: forward WS messages using sessionRef.id
Server->>Store: update token-budget/status and dedupe metadata using sessionRef.id
end
Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
server/openai-codex.js (1)
228-247:⚠️ Potential issue | 🔴 CriticalLine 247 resets
currentSessionIdto the stalesessionIdafter the fallback deliberately clears it.The resume fallback logic has two flaws:
Line 247 undermines the fallback (CRITICAL): After the catch handler at line 237 creates a new thread with
currentSessionId = null, line 247 unconditionally executes:currentSessionId = thread.id || sessionId || `codex-${Date.now()}`;The
@openai/codex-sdkdoes not populatethread.iduntil the backend assigns it via athread.startedevent (which fires duringrunStreamedat line 262). For a freshly created thread,thread.idis undefined, so this expression falls back to the stalesessionIdthe user was trying to resume:
activeCodexSessions.set(currentSessionId, ...)registers the new thread under the old ID- The
session_createdmessage at line 259 advertises the stale ID to the client- The sidebar's stale entry appears "active" and never receives the new thread ID
The entire fallback intent is defeated.
Error handling does not wrap
runStreamed(MAJOR): The inner try-catch (lines 230–243) only wrapsresumeThread(). The SDK does not validate thread existence synchronously; it spawns the CLI and validates duringrunStreamed()at line 262, which sits in the outer try-catch (line 15). When a missing thread error surfaces from the CLI, the error message won't match the "not found"/"No conversation" patterns the inner catch expects—the fallback never runs for the actual error scenario.Suggested fix for concern (1)
// Start or resume thread + let resumeFailed = false; if (sessionId) { try { thread = codex.resumeThread(sessionId, threadOptions); } catch (resumeError) { // Check if error is about thread not found const errorMsg = resumeError?.message || String(resumeError); - if (errorMsg.includes('thread') && (errorMsg.includes('not found') || errorMsg.includes('No conversation'))) { + if (/thread/i.test(errorMsg) && /(not found|no conversation)/i.test(errorMsg)) { console.log(`[WARN] Resume thread ${sessionId} not found, starting new thread`); thread = codex.startThread(threadOptions); - currentSessionId = null; // let SDK assign new ID + resumeFailed = true; } else { throw resumeError; } } } else { thread = codex.startThread(threadOptions); } - // Get the thread ID - currentSessionId = thread.id || sessionId || `codex-${Date.now()}`; + // Get the thread ID (avoid falling back to the stale sessionId after a failed resume) + currentSessionId = thread.id || (resumeFailed ? null : sessionId) || `codex-${Date.now()}`;For concern (2), extend the error catch around
runStreamedand subsequent event handling, or mirror therunSDKQueryLooppattern fromclaude-sdk.jsif you need to also handle delayed thread-not-found errors during streaming.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/openai-codex.js` around lines 228 - 247, The resume fallback currently resets currentSessionId back to the stale sessionId and only catches errors from codex.resumeThread(), which lets a delayed “thread not found” error from runStreamed bypass the fallback; update the logic around resumeThread/startThread/runStreamed so you never fall back to the original sessionId: after creating a new thread set currentSessionId only from thread.id (or a newly generated id) but do NOT use the incoming sessionId as a fallback; and expand the try-catch to also cover codex.runStreamed and subsequent event handling (the same error detection used in the resumeThread catch that checks resumeError.message for "thread" + "not found"/"No conversation") so you can detect a late missing-thread error, call codex.startThread(), set currentSessionId = null before starting a new thread, and re-run the stream/handler; reference resumeThread, startThread, runStreamed, activeCodexSessions, and the session_created emission when implementing these changes.
♻️ Duplicate comments (1)
server/openai-codex.js (1)
232-241:⚠️ Potential issue | 🟡 MinorBrittle error-message detection.
The matcher requires the literal lowercase substring
threadAND one ofnot found/No conversation. The Codex CLI surfaces multiple casings/phrasings (e.g.Thread not found,Conversation … not found) and these can drift with SDK upgrades. Prefer a case-insensitive regex — e.g./thread.*not found/iplus an alternative for/no conversation/i— and consider matching CLI error codes when available rather than free-text. Already covered by the diff in the previous comment.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/openai-codex.js` around lines 232 - 241, The current brittle check on resumeError uses plain substring matching for 'thread' and 'not found'/'No conversation'; replace it with a case-insensitive regex approach (and prefer error codes if available) to robustly detect "thread not found"/"no conversation" cases: update the logic around resumeError (the block that inspects resumeError.message) to test with regexes like /thread.*not found/i and /no conversation/i (or check any SDK error code property if present) and only then call codex.startThread(threadOptions) and clear currentSessionId/sessionId as before; leave all other error behavior (rethrowing resumeError) unchanged.
🧹 Nitpick comments (1)
server/claude-sdk.js (1)
562-763:sessionRefrefactor is consistently applied — LGTM.The switch from
capturedSessionIdto a mutablesessionRefis propagated cleanly through every WS payload (permission_request,permission_cancelled,complete,error), notifications, dedupe keys, and theNotificationhook closure. After retry, downstream consumers correctly read the new ID viaws.setSessionId(sessionRef.id)(consistent with theWebSocketWritercontract).One pre-existing minor: the
completeevent on line 726 usessessionRef.idwithout a fallback, so if the retry instance never yielded asession_idmessage (extremely unlikely but possible on early failure) thesessionIdwould beundefined. The error path on line 754 already guards against this withsessionRef.id || sessionId || null; consider the same guard on the success completion for symmetry.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/claude-sdk.js` around lines 562 - 763, The completion event uses sessionRef.id directly which can be undefined on very early success paths; in queryClaudeSDK update the createNormalizedMessage call that sends the 'complete' event so its sessionId uses a safe fallback (sessionRef.id || sessionId || null) — locate the ws.send({ kind: 'complete', ... }) call inside queryClaudeSDK and change only the sessionId field to use the same guard pattern used elsewhere (e.g., in 'error' and permission messages) to ensure consistent null-safe behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@server/claude-sdk.js`:
- Around line 480-516: The loop never emits a session_created when we fall back
to a new session because the guard checks !originalSessionId, so when a stale
originalSessionId exists the new sessionId is never sent; update the logic in
the message processing loop (the for-await over currentInstance where
sessionRef, originalSessionId and sessionCreatedSentRef are used) to detect a
retry/new-session case and force-send the session_created message: compare
sessionRef.id to originalSessionId (or track a local retried flag set when you
replace the stale instance) and, if they differ and sessionCreatedSentRef not
already set, call ws.send(createNormalizedMessage({ kind: 'session_created',
newSessionId: sessionRef.id, sessionId: sessionRef.id, provider: 'claude' }))
and set sessionCreatedSentRef; also convert sessionCreatedSentRef to a mutable
object (e.g., { sent: false }) or a closure-scoped boolean so reassignments
persist consistently across call sites.
- Around line 517-553: The stale-session detection using
loopError?.message?.includes('No conversation found with session ID') is
fragile—change it to a case-insensitive pattern match (e.g., /no conversation
found/i) and call processLogger.warn/error with the full loopError to surface
message drift; before calling removeSession(sessionRef.id) guard that
sessionRef.id is truthy and actually tracked (e.g., check your sessions Map or
hasSession(sessionRef.id)) so you don't silently delete the wrong id; simplify
the CLAUDE_CODE_STREAM_CLOSE_TIMEOUT handling by removing the
prevTimeout/save/restore branch in the retry path and unconditionally delete the
env var (or add a clarifying comment) since outer queryClaudeSDK already
restores it; keep the retry flow using query({ prompt, options: sdkOptions })
and preserve hasRetried/currentInstance behavior.
---
Outside diff comments:
In `@server/openai-codex.js`:
- Around line 228-247: The resume fallback currently resets currentSessionId
back to the stale sessionId and only catches errors from codex.resumeThread(),
which lets a delayed “thread not found” error from runStreamed bypass the
fallback; update the logic around resumeThread/startThread/runStreamed so you
never fall back to the original sessionId: after creating a new thread set
currentSessionId only from thread.id (or a newly generated id) but do NOT use
the incoming sessionId as a fallback; and expand the try-catch to also cover
codex.runStreamed and subsequent event handling (the same error detection used
in the resumeThread catch that checks resumeError.message for "thread" + "not
found"/"No conversation") so you can detect a late missing-thread error, call
codex.startThread(), set currentSessionId = null before starting a new thread,
and re-run the stream/handler; reference resumeThread, startThread, runStreamed,
activeCodexSessions, and the session_created emission when implementing these
changes.
---
Duplicate comments:
In `@server/openai-codex.js`:
- Around line 232-241: The current brittle check on resumeError uses plain
substring matching for 'thread' and 'not found'/'No conversation'; replace it
with a case-insensitive regex approach (and prefer error codes if available) to
robustly detect "thread not found"/"no conversation" cases: update the logic
around resumeError (the block that inspects resumeError.message) to test with
regexes like /thread.*not found/i and /no conversation/i (or check any SDK error
code property if present) and only then call codex.startThread(threadOptions)
and clear currentSessionId/sessionId as before; leave all other error behavior
(rethrowing resumeError) unchanged.
---
Nitpick comments:
In `@server/claude-sdk.js`:
- Around line 562-763: The completion event uses sessionRef.id directly which
can be undefined on very early success paths; in queryClaudeSDK update the
createNormalizedMessage call that sends the 'complete' event so its sessionId
uses a safe fallback (sessionRef.id || sessionId || null) — locate the ws.send({
kind: 'complete', ... }) call inside queryClaudeSDK and change only the
sessionId field to use the same guard pattern used elsewhere (e.g., in 'error'
and permission messages) to ensure consistent null-safe behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 9324b9c2-a8dd-402b-878a-50334bd21b62
📒 Files selected for processing (2)
server/claude-sdk.jsserver/openai-codex.js
…found) fix: address CodeRabbit review feedback - Claude: fix session_created not emitted on retry (was gated by !originalSessionId) - Claude: use regex for error detection instead of brittle string matching - Claude: add null-safe guard on 'complete' event sessionId - Codex: fix currentSessionId falling back to stale sessionId after retry - Codex: use case-insensitive regex for error detection
f2179c4 to
4b8c68c
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
server/claude-sdk.js (1)
476-476:sessionsServiceis already module-imported — no need to thread it through as a parameter.
sessionsServiceis imported at line 27 and used directly elsewhere in this file. Passing it explicitly intorunSDKQueryLooponly adds noise (and risks the caller passing the wrong reference). Drop the trailing parameter and use the import directly inside the loop.♻️ Suggested cleanup
-async function runSDKQueryLoop(queryInstance, sdkOptions, prompt, sessionRef, originalSessionId, sessionCreatedSent, tempImagePaths, tempDir, ws, sessionsService) { +async function runSDKQueryLoop(queryInstance, sdkOptions, prompt, sessionRef, originalSessionId, sessionCreatedSent, tempImagePaths, tempDir, ws) {- await runSDKQueryLoop(queryInstance, sdkOptions, finalCommand, sessionRef, sessionId, sessionCreatedSent, tempImagePaths, tempDir, ws, sessionsService); + await runSDKQueryLoop(queryInstance, sdkOptions, finalCommand, sessionRef, sessionId, sessionCreatedSent, tempImagePaths, tempDir, ws);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/claude-sdk.js` at line 476, The function signature for runSDKQueryLoop currently accepts sessionsService even though sessionsService is already imported at the top of the module; remove the trailing sessionsService parameter from runSDKQueryLoop's signature and stop passing it in every call site, then reference the module-level sessionsService import directly inside the function body (update any call in this file that passed sessionsService to omit that argument). Ensure you only modify the signature and callers named runSDKQueryLoop and do not shadow or re-declare sessionsService inside the function.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@server/claude-sdk.js`:
- Around line 527-538: The retry path in runSDKQueryLoop unconditionally sets
and deletes process.env.CLAUDE_CODE_STREAM_CLOSE_TIMEOUT, clobbering any
pre-existing value; fix by snapshotting the existing value (e.g., const
prevStreamTimeout = process.env.CLAUDE_CODE_STREAM_CLOSE_TIMEOUT) before setting
it to '300000', perform the retry construction (retryInstance = query({ prompt,
options: sdkOptions })) inside try/catch, and then restore the original value in
a finally-style step (if prevStreamTimeout is undefined delete the env var,
otherwise restore process.env.CLAUDE_CODE_STREAM_CLOSE_TIMEOUT =
prevStreamTimeout); apply this around both the initial try and the catch retry
so the env is always restored, referencing the existing retryInstance,
sdkOptions, and query(...) call and mirroring the snapshot/restore logic used by
queryClaudeSDK/prevStreamTimeout.
In `@server/openai-codex.js`:
- Around line 229-244: The current try/catch around codex.resumeThread(...) is
ineffective because validation happens later; move the "thread not found|no
conversation" regex handling into the error handler that wraps
thread.runStreamed(...) and the for-await loop over streamedTurn.events: on
matching that regex set resumeFailed = true, call thread =
codex.startThread(threadOptions), recompute currentSessionId, emit the
session_created event, and restart the streamed turn exactly once (i.e., retry
the runStreamed/for-await sequence once after rebuilding the thread), otherwise
rethrow the error; keep references to codex.resumeThread, codex.startThread,
thread.runStreamed, streamedTurn.events, resumeFailed, currentSessionId, and
session_created to locate and implement the change.
---
Nitpick comments:
In `@server/claude-sdk.js`:
- Line 476: The function signature for runSDKQueryLoop currently accepts
sessionsService even though sessionsService is already imported at the top of
the module; remove the trailing sessionsService parameter from runSDKQueryLoop's
signature and stop passing it in every call site, then reference the
module-level sessionsService import directly inside the function body (update
any call in this file that passed sessionsService to omit that argument). Ensure
you only modify the signature and callers named runSDKQueryLoop and do not
shadow or re-declare sessionsService inside the function.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 69e355eb-ea06-4c15-9ab4-ce7db66aadbd
📒 Files selected for processing (2)
server/claude-sdk.jsserver/openai-codex.js
| delete sdkOptions.resume; | ||
| process.env.CLAUDE_CODE_STREAM_CLOSE_TIMEOUT = '300000'; | ||
|
|
||
| let retryInstance; | ||
| try { | ||
| retryInstance = query({ prompt, options: sdkOptions }); | ||
| } catch (hookError) { | ||
| console.warn('Retry query with hooks failed:', hookError?.message || hookError); | ||
| delete sdkOptions.hooks; | ||
| retryInstance = query({ prompt, options: sdkOptions }); | ||
| } | ||
| delete process.env.CLAUDE_CODE_STREAM_CLOSE_TIMEOUT; |
There was a problem hiding this comment.
Retry path clobbers any user-set CLAUDE_CODE_STREAM_CLOSE_TIMEOUT instead of save/restore.
Outer queryClaudeSDK correctly snapshots prevStreamTimeout at line 680 and restores it at lines 700–705. The retry inside runSDKQueryLoop does neither: line 528 unconditionally writes '300000', and line 538 unconditionally deletes the var — so if a user (or test) had CLAUDE_CODE_STREAM_CLOSE_TIMEOUT set in the host env, that value is lost as a side effect of a stale-resume retry. Mirror the outer save/restore to keep behavior symmetric:
🛠️ Save/restore around the retry query construction
delete sdkOptions.resume;
- process.env.CLAUDE_CODE_STREAM_CLOSE_TIMEOUT = '300000';
-
- let retryInstance;
- try {
- retryInstance = query({ prompt, options: sdkOptions });
- } catch (hookError) {
- console.warn('Retry query with hooks failed:', hookError?.message || hookError);
- delete sdkOptions.hooks;
- retryInstance = query({ prompt, options: sdkOptions });
- }
- delete process.env.CLAUDE_CODE_STREAM_CLOSE_TIMEOUT;
+ const prevRetryTimeout = process.env.CLAUDE_CODE_STREAM_CLOSE_TIMEOUT;
+ process.env.CLAUDE_CODE_STREAM_CLOSE_TIMEOUT = '300000';
+ let retryInstance;
+ try {
+ try {
+ retryInstance = query({ prompt, options: sdkOptions });
+ } catch (hookError) {
+ console.warn('Retry query with hooks failed:', hookError?.message || hookError);
+ delete sdkOptions.hooks;
+ retryInstance = query({ prompt, options: sdkOptions });
+ }
+ } finally {
+ if (prevRetryTimeout !== undefined) {
+ process.env.CLAUDE_CODE_STREAM_CLOSE_TIMEOUT = prevRetryTimeout;
+ } else {
+ delete process.env.CLAUDE_CODE_STREAM_CLOSE_TIMEOUT;
+ }
+ }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@server/claude-sdk.js` around lines 527 - 538, The retry path in
runSDKQueryLoop unconditionally sets and deletes
process.env.CLAUDE_CODE_STREAM_CLOSE_TIMEOUT, clobbering any pre-existing value;
fix by snapshotting the existing value (e.g., const prevStreamTimeout =
process.env.CLAUDE_CODE_STREAM_CLOSE_TIMEOUT) before setting it to '300000',
perform the retry construction (retryInstance = query({ prompt, options:
sdkOptions })) inside try/catch, and then restore the original value in a
finally-style step (if prevStreamTimeout is undefined delete the env var,
otherwise restore process.env.CLAUDE_CODE_STREAM_CLOSE_TIMEOUT =
prevStreamTimeout); apply this around both the initial try and the catch retry
so the env is always restored, referencing the existing retryInstance,
sdkOptions, and query(...) call and mirroring the snapshot/restore logic used by
queryClaudeSDK/prevStreamTimeout.
|
Only users with a collaborator, contributor, member, or owner role can interact with CodeRabbit. |
Summary
When CloudCLI UI has an old session ID in the sidebar that no longer exists in the CLI/SDK (e.g. session was cleaned up by
Claude/Codex), sending a message would fail with an error and give no response to the user.
This change adds automatic fallback for all SDK-based providers:
runSDKQueryLoop, catch "No conversation found" error → remove staleresumeoption → retry as new session with same prompt
resumeThread(sessionId)fails with "not found" error → start a new threadThis is a common issue because:
Testing
if (session && session.cliSessionId)only)... || cursor)The change is minimal and isolated to the affected files:
server/claude-sdk.jsandserver/openai-codex.js. Existing behaviorunchanged for valid sessions.
Summary by CodeRabbit